Regionalisation with Spatially Constrained Cluster Analysis

Published

December 5, 2022

case study : Regionalisation by Multivariate Water Point Attributes with Non-spatially Constrained and Spatially Constrained Clustering Methods.

1. OVERVIEW

Regionalisation with Spatially Constrained clustering analysis requires similar observations to be grouped according to their statistical attributes and spatial location.

1.1 Objectives

  • Total number of water points by status, i.e. functional, non-functional, and unknown;

  • Percentage of water points by :

    • status (functional, non-functional, and unknown);

    • deployed water technology (hand pump, mechanical pump, stand tap, etc.) ;

    • usage capacity (1000, 300, 250, 50);

    • rural or urban.

1.2 Study Area Introduction

Alpha-3 Code : NGA

Population : 225 million (1st in Africa, 6th globally)

Local Government Areas (LGA) : 774

Water Point Observations : 95,008

Environmental Aspects :

  • Geography :

    • Southwest - “rugged” highland.

    • Southeast - hills and mountains, which form the Mambilla Plateau, the highest plateau in Nigeria.

  • Hydrology :

    • Two (2) main catchment areas - Chad Basin & Niger catchment area.

    • Surface area of lake Chad is shrinking recent decades due to irrigation activities.1

    • Untreated wastes dump in places resulted in waterways and groundwater pollution.2

  • Vegetation Coverage :

    • Lost nearly 80% of primary forest by 2012.3

    • States with dense forests concentrated : Bayelsa, Cross River, Edo, Ekiti, Ondo, Osun, Rivers, and Taraba.

  • 1 Wikipedia. Nigeria. https://en.wikipedia.org/wiki/Nigeria

  • 2 Ogbonna, D.N., Ekweozor, I.K.E., Igwe, F.U. (2002). “Waste Management: A Tool for Environmental Protection in Nigeria”. Ambio: A Journal of the Human Environment. 31 (1): 55–57. doi:10.1639/0044-7447(2002)031[0055:wmatfe]2.0.co;2.

  • 3 https://rainforests.mongabay.com/20nigeria.htm

  • 1.3 Scope of Works

    • import the shapefile into R with the appropriate sf method, and save it in a simple feature data frame format;
    Note

    Three (3) Projected Coordinate Systems of Nigeria, EPSG : 26391, 26392, and 26303.

    • derive the proportion of functional and non-functional water points at LGA level (i.e. ADM2) by appropriate tidyr and dplyr methods;

    • combine geospatial and aspatial data frames into a simple feature data frame.

    • delineate water points measures functional regions by using :

      • conventional hierarchical clustering.

      • spatially constrained clustering algorithms.

    • plot two (2) main types of maps below :

      Thematic Mapping

      Show the derived water-point measures by appropriate statistical graphics and choropleth mapping technique.

      Analytical Mapping

      Plot delineated functional regions using non-spatially constrained and spatially constrained clustering algorithms.

    2. R PACKAGE REQUIRED

    The following are the packages required for this exercise :

    2.1 Load R Packages

    Usage of the code chunk below :

    p_load( ) - pacman - to load packages. This function will attempt to install the package from CRAN or pacman repository list if its found not installed.

    Show the code
    pacman::p_load(sf, tidyverse, questionr, janitor, psych, ggplot2, gcookbook, tmap, ggpubr, egg, corrplot, gtsummary, regclass, caret, heatmaply, ggdendro, cluster, factoextra, spdep, ClustGeo, GGally, skimr, stringr, funModeling, knitr, caTools, viridis, rgeoda)

    3. GEOSPATIAL DATA

    3.1 Acquire Data Source

    Note

    The file size of the downloaded data is about 422 MB due to water points data from multiple countries.

    • Such file size may require extra effort and time to manage the code chunks and files in the R environment before pushing them to GitHub.

    Hence, to avoid any error in pushing files larger than 100 MB to Git, filtered Nigeria water points and removed unnecessary variables before uploading into the R environment.

    Therewith, the CSV file size should be lesser than 100 MB.

  • 4 Runfola, D. et al. (2020) geoBoundaries: A global database of political administrative boundaries. PLoS ONE 15(4): e0231866. https://doi.org/10.1371/journal.pone.0231866

  • 3.2 Import Attribute Data

    Four (4) data frames to be created for different context, i.e.

    • wp_coord = coordinated related variables.

    • wp_cond = status and conditions related variables.

    • wp_adm = administrative and measurements related variables.

    • wp = master file that combine all three (3) data frames above.

    3.2.4 Create Master File

    Usage of the code chunk below :

    left_join( ) - dplyr - to combine wp_coord, wp_cond and wp_adm.

    Show the code
    wp <- left_join(
      (left_join
       (wp_coord,wp_cond,
         by = c("row_id")
         )),
      wp_adm, 
      by = c("row_id"))

    3.2.5 Convert Well Known Text (WKT) Data to SF Data Frame

    • The “New Georeferenced Column” in wp_rds contains spatial data in a WKT format.

    • Two (2) steps to convert the WKT data format into an sf data frame.

    3.2.5.1 derive new field :: “geometry”

    Usage of the code chunk below :

    st_as_sfc( ) - sf - to convert foreign geometry object `New Georeferenced Column` to an sfc object

    Show the code
    wp$geometry = st_as_sfc(wp$`New Georeferenced Column`)

    3.2.5.2 convert to SF Data Frame

    Usage of the code chunk below :

    st_sf( ) - sf - to convert the tibble data frame into sf data frame with crs first set to WGS 84 (EPSG : 4326).

    st_crs( ) - sf - to retrieve coordinate reference system from the object.

    Show the code
    wp_sf<- st_sf(wp, crs = 4326)
    st_crs(wp_sf)
    Coordinate Reference System:
      User input: EPSG:4326 
      wkt:
    GEOGCRS["WGS 84",
        ENSEMBLE["World Geodetic System 1984 ensemble",
            MEMBER["World Geodetic System 1984 (Transit)"],
            MEMBER["World Geodetic System 1984 (G730)"],
            MEMBER["World Geodetic System 1984 (G873)"],
            MEMBER["World Geodetic System 1984 (G1150)"],
            MEMBER["World Geodetic System 1984 (G1674)"],
            MEMBER["World Geodetic System 1984 (G1762)"],
            MEMBER["World Geodetic System 1984 (G2139)"],
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ENSEMBLEACCURACY[2.0]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        CS[ellipsoidal,2],
            AXIS["geodetic latitude (Lat)",north,
                ORDER[1],
                ANGLEUNIT["degree",0.0174532925199433]],
            AXIS["geodetic longitude (Lon)",east,
                ORDER[2],
                ANGLEUNIT["degree",0.0174532925199433]],
        USAGE[
            SCOPE["Horizontal component of 3D system."],
            AREA["World."],
            BBOX[-90,-180,90,180]],
        ID["EPSG",4326]]

    3.3 Import Boundary Data

    Usage of the code chunk below :

    st_read( ) - sf - to read simple features.

    select( ) - dplyr - to select “shapeName” variable.

    Show the code
    bdy_nga <- st_read(dsn = "data/geospatial",
                   layer = "geoBoundaries-NGA-ADM2",
                   crs = 4326) %>%
      select(shapeName)
    
    problems(bdy_nga)

    3.3.1 save and read RDS file :: wp_adm

    Show the code
    write_rds(bdy_nga,"data/geodata/bdy_nga.rds",compress = "xz")
    bdy_nga <- read_rds("data/geodata/bdy_nga.rds")

    3.3.2 Review Imported File

    3.3.2.1 check for missing and duplicated data

    Show the code
    skim(bdy_nga)
    Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No
    user-defined `sfl` provided. Falling back to `character`.
    Data summary
    Name bdy_nga
    Number of rows 774
    Number of columns 2
    _______________________
    Column type frequency:
    character 2
    ________________________
    Group variables None

    Variable type: character

    skim_variable n_missing complete_rate min max empty n_unique whitespace
    shapeName 0 1 3 18 0 768 0
    geometry 0 1 878 33370 0 774 0

    Remarks :

    • There is no missing data.

    • There 774 unique “geometry” but only 768 unique “shapeName”

      • That’s mean there are 6 values of “shapeName” duplicated among the identified unique shapeNames.

    3.3.2.2 list the unique “shapeName” associated with duplication

    Usage of the code chunk below :

    add_count( ) - dplyr - to count observations by group

    filter( ) - dplyr - to retain shapeName that has count not equal to 1.

    Show the code
    dupl_shapeName <- bdy_nga %>%
      add_count(bdy_nga$shapeName) %>%
      filter(n!=1) %>%
      select(-n)
    
    freq(dupl_shapeName$shapeName)
    Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
    of ggplot2 3.3.4.
    ℹ The deprecated feature was likely used in the funModeling package.
      Please report the issue at <https://github.com/pablo14/funModeling/issues>.

           var frequency percentage cumulative_perc
    1    Bassa         2      16.67           16.67
    2 Ifelodun         2      16.67           33.34
    3 Irepodun         2      16.67           50.01
    4 Nasarawa         2      16.67           66.68
    5      Obi         2      16.67           83.35
    6 Surulere         2      16.67          100.00

    3.3.2.3 verify findings in section 3.3.1.2

    Usage of the code chunk below :

    tmap_mode( ) - tmap - to set tmap mode to static plotting or interactive.

    tm_shape( ) - tmap - to specify the shape object.

    tm_polygons( ) - tmap - to fill the polygons and draw the polygon borders.

    tm_view( ) - tmap - to set the options for the interactive tmap viewer.

    tm_fill( ) - tmap - to specify either which colour to be used or which data variable mapped to the colour palette.

    tm_borders( ) - tmap - to draw the polygon borders.

    tmap_style( ) - tmap - to set the tmap style.

    tm_layout( ) - tmap - to set the layout of cartographic map.

    Show the code
    tmap_mode("view")
    tmap mode set to interactive viewing
    Show the code
    tm_shape(bdy_nga)+
      tm_polygons()+
      tm_view(set.zoom.limits = c(6,8))+
    
    tm_shape(dupl_shapeName)+
      tm_fill("shapeName",
              n = 6,
              style = "jenks")+
      tm_borders(alpha = 0.5)+
      tmap_style("albatross")+
      tm_layout(main.title = "Distribution of Duplicated ShapeName",
                main.title.size = 1.3,
                main.title.position = "center")
    tmap style set to "albatross"
    other available styles are: "white", "gray", "natural", "cobalt", "col_blind", "beaver", "bw", "classic", "watercolor" 
    Show the code
    tmap_mode("plot")
    tmap mode set to plotting

    Remarks :

    The plot above indicates those duplicated water points are not within the same location.

    3.3.2.4 acquire State info for duplicated value

    The State info to be combined with the duplicated “shapeName”. This will make all the shapeName unique.

    Show the code
    dupl_shapeName %>%
      mutate(st_centroid(
        dupl_shapeName$geometry, of_largest_polygon = FALSE))
    Simple feature collection with 12 features and 2 fields
    Active geometry column: geometry
    Geometry type: MULTIPOLYGON
    Dimension:     XY
    Bounding box:  xmin: 3.316459 ymin: 6.459038 xmax: 9.020704 ymax: 12.05035
    Geodetic CRS:  WGS 84
    First 10 features:
       shapeName bdy_nga$shapeName                       geometry
    1      Bassa             Bassa MULTIPOLYGON (((6.708541 7....
    2      Bassa             Bassa MULTIPOLYGON (((8.823522 10...
    3   Ifelodun          Ifelodun MULTIPOLYGON (((4.664107 8....
    4   Ifelodun          Ifelodun MULTIPOLYGON (((4.721977 7....
    5   Irepodun          Irepodun MULTIPOLYGON (((5.05493 8.0...
    6   Irepodun          Irepodun MULTIPOLYGON (((4.543349 7....
    7   Nasarawa          Nasarawa MULTIPOLYGON (((8.554589 11...
    8   Nasarawa          Nasarawa MULTIPOLYGON (((7.493228 8....
    9        Obi               Obi MULTIPOLYGON (((8.191919 6....
    10       Obi               Obi MULTIPOLYGON (((9.008576 8....
       st_centroid(dupl_shapeName$geometry, of_largest_polygon = FALSE)
    1                                         POINT (7.031827 7.791971)
    2                                         POINT (8.782946 10.08015)
    3                                         POINT (5.052235 8.544311)
    4                                         POINT (4.636735 7.920948)
    5                                         POINT (4.926215 8.169349)
    6                                          POINT (4.498797 7.84861)
    7                                         POINT (8.578262 12.00446)
    8                                         POINT (7.760272 8.304034)
    9                                         POINT (8.281026 7.022495)
    10                                         POINT (8.734777 8.35534)
    Show the code
    glimpse(dupl_shapeName)
    Rows: 12
    Columns: 3
    $ shapeName           <chr> "Bassa", "Bassa", "Ifelodun", "Ifelodun", "Irepodu…
    $ `bdy_nga$shapeName` <chr> "Bassa", "Bassa", "Ifelodun", "Ifelodun", "Irepodu…
    $ geometry            <MULTIPOLYGON [°]> MULTIPOLYGON (((6.708541 7...., MULTI…
    lga row_id headquarter state iso3166code dupl_shapeName_coord lga_office_coord
    Bassa 94 Oguma Kogi NG.KO.BA 7.791971, 7.031827 7.80932, 6.74853
    Bassa 95 Bassa Plateau NG.PL.BA 10.08015, 8.782946 10.11143, 8.71559
    Ifelodun 304 Share Kwara NG.KW.IF 8.544311, 5.052235 8.5 5.0
    Ifelodun 305 Ikirun Osun NG.OS.ID 7.920948, 4.636735 7.5 4.5
    Irepodun 355 Omu Aran Kwara NG.KW.IR 8.169349, 4.926215 8.5 5.0
    Irepodun 356 Ilobu Osun NG.OS.IP 7.84861, 4.498797 7.5 4.5
    Nasarawa 519 Bompai Kano NG.KN.NA 12.00446, 8.578262 11.5, 8.5
    Nasarawa 520 Nasarawa Nasarawa NG.NA.NA 8.304034, 7.760272 8.53477, 7.70153
    Obi 546 Obarike-Ito Benue NG.BE.OB 7.022495, 8.281026 7.01129, 8.33118
    Obi 547 Obi Nasarawa NG.NA.OB 8.35534, 8.734777 8.37944, 8.78561
    Surelere 693 Surelere Lagos NG.LA.SU 6.493217, 3.346919 6.50048, 3.35488
    Surelere 694 Iresa-Adu Oyo NG.OY.SU 8.088897, 4.393574 8.08459, 4.38538

    3.4 Data Wrangling

    3.4.1 Edit Duplicated Value :: “shapeName”

    Two (2) Main Steps involved :

    1. Combine “shapeName” with the State name to make each of them unique.
    2. Replace the “shapeName” value according to each row index.5
  • 5 Ong Z.R.J. (2022). Geospatial Analytics for Social Good - Understanding Nigeria Water functional and non-functional water point rate. https://jordan-isss624-geospatial.netlify.app/posts/geo/geospatial_exercise/#checking-of-duplicated-area-name

  • Show the code
    bdy_nga$shapeName[c(94,95,304,305,355,356,519,520,546,547,693,694)] <- 
      c("Bassa Kogi",
        "Bassa Plateau",
        "Ifelodun Kwara",
        "Ifelodun Osun",
        "Irepodun Kwara",
        "Irepodun Osun",
        "Nasarawa Kano",
        "Nasarawa Nasarawa",
        "Obi Nasarawa",
        "Obi Benue",
        "Surulere Lagos",
        "Surulere Oyo")
    
    bdy_nga$shapeName[c(94,95,304,305,355,356,519,520,546,547,693,694)]
     [1] "Bassa Kogi"        "Bassa Plateau"     "Ifelodun Kwara"   
     [4] "Ifelodun Osun"     "Irepodun Kwara"    "Irepodun Osun"    
     [7] "Nasarawa Kano"     "Nasarawa Nasarawa" "Obi Nasarawa"     
    [10] "Obi Benue"         "Surulere Lagos"    "Surulere Oyo"     

    3.4.1.1 validate edited value :: “shapeName”

    Show the code
    dupl_shapeName_val <- bdy_nga %>%
      add_count(bdy_nga$shapeName) %>%
      filter(n!=1) %>%
      select(-n)
    
    dupl_shapeName_val
    Simple feature collection with 0 features and 2 fields
    Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
    Geodetic CRS:  WGS 84
    [1] shapeName         bdy_nga$shapeName geometry         
    <0 rows> (or 0-length row.names)

    3.4.2 Perform Point-in-Polygon Overlay

    Combine both attribute and boundary of the water points into a simple feature object.

    3.4.2.1 join objects :: wp_sf, bdy_nga

    Usage of the code chunk below :

    st_join( ) - sf - to join sf-class objects based on geometry, namely, wp_sf and bdy_nga.

    Show the code
    wp_joined <- st_join(x = wp_sf,
                         y = bdy_nga,
                         join = st_intersects,
                         left = TRUE)

    -- save and read RDS File :: wp_joined

    Show the code
    write_rds(wp_joined,"data/geodata/wp_joined.rds",compress = "xz")
    wp_joined <- read_rds("data/geodata/wp_joined.rds")

    3.4.2.2 inspect joined file :: wp_joined

    -- assess uniqueness of Water Point

    wp_joined %>% janitor::get_dupes(shapeName,lat_lon_deg)
    No duplicate combinations found of: shapeName, lat_lon_deg
    Simple feature collection with 0 features and 24 fields
    Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
    Geodetic CRS:  WGS 84
    # A tibble: 0 × 25
    # … with 25 variables: shapeName <chr>, lat_lon_deg <chr>, dupe_count <int>,
    #   row_id <dbl>, lat_deg <dbl>, lon_deg <dbl>, New Georeferenced Column <chr>,
    #   water_source <chr>, water_source_clean <chr>, water_source_category <chr>,
    #   water_tech_clean <chr>, water_tech_category <chr>, status_clean <chr>,
    #   status <chr>, clean_adm1 <chr>, clean_adm2 <chr>,
    #   water_point_population <dbl>, local_population_1km <dbl>,
    #   crucialness_score <dbl>, pressure_score <dbl>, usage_capacity <dbl>, …

    Remarks :

    Each water point observation is unique as there are no duplicate combination of “shapeName” together with “lat_lon_deg”.

    -- determine reference point :: “shapeName” or “clean_adm2”

    Show the code
    wp_reference <- (wp_joined$shapeName == wp_joined$clean_adm2)
    
    summary(wp_reference)
       Mode   FALSE    TRUE    NA's 
    logical   29856   65123      29 

    Remarks :

    • There are 29,713 “FALSE”, which is more than 30% of LGA names mismatched between “shapeName” and “clean_adm2”.

      • Since the geoBoundaries data is collected from government-published and reliable internet sources.6

        • Hence, the “shapeName” variable to be used throughout this study.
    • The 29 NA’s are 29 water points located beyond the LGA boundary, as shown in the plot below.

  • 6 Daniel et. al (2020) geoBoundaries: A global database of political administrative boundaries. PlosOne. https://doi.org/10.1371/journal.pone.0231866

  • Show the code
    tmap_mode("view")
    tmap mode set to interactive viewing
    Show the code
    tm_shape(bdy_nga) +
      tm_polygons() +
      tm_view(set.zoom.limits = c(5.5, 12))+
      
    tm_shape(filter(wp_joined, 
                    is.na(wp_reference))) +
      tm_dots(size=0.1,
              col="red")
    Show the code
    tmap_mode("plot")
    tmap mode set to plotting

    3.4.3 Remove Water Point Outside LGA Boundary

    wp_joined1 <- wp_joined %>% 
      filter(
        shapeName == clean_adm2 | shapeName != clean_adm2)

    3.4.4 Combine Unique Value

    There are 9 unique values for “status_clean”. However, four (4) of them share the same context :

    • “Non functional due to dry season”

    • “Non-Functional due to dry season”

    • “Abandoned”

    • “Abandoned/Decommissioned”

    Hence, the same context values need to combine into one unique value.

    wp_joined1$status_clean[wp_joined1$status_clean == "Non functional due to dry season"] <- "Non-Functional due to dry season"
    
    wp_joined1$status_clean[wp_joined1$status_clean == "Abandoned"] <- "Abandoned/Decommissioned"

    3.4.4.1 review combined output

    Show the code
    unique(wp_joined1$status_clean)
    [1] "Non-Functional"                   NA                                
    [3] "Functional"                       "Functional but needs repair"     
    [5] "Abandoned/Decommissioned"         "Functional but not in use"       
    [7] "Non-Functional due to dry season"

    3.4.4.2 save and read RDS file :: wp_joined1

    Save the updated values into wp_joined1 RDS file.

    Show the code
    write_rds(wp_joined1,"data/geodata/wp_joined1.rds",compress = "xz")
    wp_joined1 <- read_rds("data/geodata/wp_joined1.rds")

    3.5 Extract Water Point by Attribute

    3.5.1 Extract Functional Water Point :: wpt_functional

    Show the code
    wpt_functional <- wp_joined1 %>%
      filter(status_clean %in%
               c("Functional", 
                 "Functional but not in use",
                 "Functional but needs repair"))

    3.5.1.1 save and read RDS file :: wpt_functional

    Show the code
    write_rds(wpt_functional,"data/geodata/wpt_functional.rds",compress = "xz")
    wpt_functional <- read_rds("data/geodata/wpt_functional.rds")

    3.5.1.2 compute data table for clustering analysis

    Show the code
    wp_nga <- bdy_nga %>%
      mutate(`total_wp` = lengths(
        st_intersects(bdy_nga, wp_joined1))) %>%
      
      mutate(`wp_functional` = lengths(
        st_intersects(bdy_nga, wpt_functional))) %>%
      
      mutate(`pct_functional` = (`wp_functional`/`total_wp`*100))

    3.5.2 Extract Non-Functional Water Point :: wpt_nonFunctional

    Show the code
    wpt_nonFunctional <- wp_joined1 %>%
      filter(status_clean %in%
               c("Abandoned/Decommissioned", 
                 "Non-Functional",
                 "Non-Functional due to dry season"))

    3.5.2.1 save and read RDS file :: wpt_nonFuntional

    Show the code
    write_rds(wpt_nonFunctional,"data/geodata/wpt_nonFunctional.rds",compress = "xz")
    wpt_nonFunctional <- read_rds("data/geodata/wpt_nonFunctional.rds")

    3.5.2.2 compute additional variables into wp_nga

    Show the code
    wp_nga <- wp_nga %>%
      mutate(`wp_nonFunctional` = lengths(
        st_intersects(bdy_nga, wpt_nonFunctional))) %>%
      mutate(`pct_nonFunctional` = (`wp_nonFunctional`/`total_wp`*100))

    3.5.3 Extract Unknown Water Point

    Show the code
    wpt_unknown <- wp_joined1 %>%
      filter(status_clean == "Unknown")

    3.5.3.1 save and read RDS file :: wpt_unknown

    Show the code
    write_rds(wpt_unknown,"data/geodata/wpt_unknown.rds")
    wpt_unknown <- read_rds("data/geodata/wpt_unknown.rds")

    3.5.3.2 compute additional variables into wp_nga

    Show the code
    wp_nga <- wp_nga %>%
      mutate(`wp_unknown` = lengths(
        st_intersects(bdy_nga, wpt_unknown))) %>%
      mutate(`pct_unknown` = (`wp_unknown`/`total_wp`*100))

    3.5.4 Extract Water Points by “water_tech_category”

    Show the code
    freq(wp_joined1$water_tech_category)

                  var frequency percentage cumulative_perc
    1       Hand Pump     58735      61.84           61.84
    2 Mechanized Pump     25636      26.99           88.83
    3            <NA>     10054      10.59           99.42
    4        Tapstand       553       0.58          100.00
    5 Rope and Bucket         1       0.00          100.00

    Remarks :

    Only “Hand Pump”, “Mechanized Pump”, and “Tapstand” are to be extracted for further analysis as the rest are either less than 0.5% or “Unknown”.

    wtc_hPump <- wp_joined1 %>%
      filter(water_tech_category %in%
               "Hand Pump")
    
    wtc_mPump <- wp_joined1 %>%
      filter(water_tech_category %in%
               "Mechanized Pump")
    
    wtc_tStand <- wp_joined1 %>%
      filter(water_tech_category %in%
               "Tapstand")
    
    wp_nga <- wp_nga %>%
      mutate(`total_handPump` = lengths(
        st_intersects(bdy_nga, wtc_hPump)
      )) %>%
      mutate(`total_mechPump` = lengths(
        st_intersects(bdy_nga, wtc_mPump)
      )) %>%
        mutate(`total_tapStand` = lengths(
        st_intersects(bdy_nga, wtc_tStand)
      )) %>%
      mutate(`pct_handPump` = (`total_handPump`/`total_wp`*100)) %>%
      mutate(`pct_mechPump` = (`total_mechPump`/`total_wp`*100)) %>%
      mutate(`pct_tapStand` = (`total_tapStand`/`total_wp`*100))

    3.5.5 Extract Water Point by “usage_capacity”

    Show the code
    freq(wp_joined1$usage_capacity)

       var frequency percentage cumulative_perc
    1  300     68768      72.40           72.40
    2 1000     25636      26.99           99.39
    3  250       573       0.60           99.99
    4   50         2       0.00          100.00

    Remarks :

    • Only “300”, “1000”, and “250” are to be extracted for further analysis as the rest are either less than 0.5% or “Unknown”.

    • But, “50” will be included in the new variable “total_ucN1000” as part of the none ‘1000’ “usage_capacity” value.

    uCap_300 <- wp_joined1 %>%
      filter(usage_capacity %in%
               "300")
    
    uCap_1000 <- wp_joined1 %>%
      filter(usage_capacity %in%
               "1000")
    
    uCap_250 <- wp_joined1 %>%
      filter(usage_capacity %in%
               "250")
    
    uCap_50 <- wp_joined1 %>%
      filter(usage_capacity %in%
               "50")
    
    wp_nga <- wp_nga %>%
      mutate(`total_uc300` = lengths(
        st_intersects(bdy_nga, uCap_300)
      )) %>%
      mutate(`total_uc1000` = lengths(
        st_intersects(bdy_nga, uCap_1000)
      )) %>%
      mutate(`total_uc250` = lengths(
        st_intersects(bdy_nga, uCap_250)
      )) %>%
      mutate(`total_uc50` = lengths(
        st_intersects(bdy_nga, uCap_50)
      )) %>%
      mutate(`total_ucN1000` = ((lengths(
        st_intersects(
          bdy_nga, uCap_300))) + (lengths(
            st_intersects(
              bdy_nga, uCap_250))) + (lengths(
                st_intersects(
                  bdy_nga, uCap_50))))
        )%>%
               
      mutate(`pct_ucN1000` = (`total_ucN1000`/`total_wp`*100)) %>%
      mutate(`pct_uc300` = (`total_uc300`/`total_wp`*100)) %>%
      mutate(`pct_uc1000` = (`total_uc1000`/`total_wp`*100)) %>%
      mutate(`pct_uc250` = (`total_uc250`/`total_wp`*100))

    3.5.6 Extract Water Point by “is_urban”

    urban_1 <- wp_joined1 %>%
      filter(is_urban %in%
               "TRUE")
    
    urban_0 <- wp_joined1 %>%
      filter(is_urban %in%
               "FALSE")
    
    wp_nga <- wp_nga %>%
      mutate(`total_urban1` = lengths(st_intersects
                                      (bdy_nga, urban_1)
                                      )) %>%
      mutate(`total_urban0` = lengths(st_intersects
                                      (bdy_nga, urban_0)
                                      )) %>%
      mutate(`pct_urban1` = (`total_urban1`/`total_wp`*100)) %>%
      mutate(`pct_urban0` = (`total_urban0`/`total_wp`*100))

    3.5.7 Save and Read RDS File :: wp_nga

    Show the code
    write_rds(wp_nga,"data/geodata/wp_nga.rds")
    wp_nga <- read_rds("data/geodata/wp_nga.rds")

    3.6 Transform Coordinate System

    The EPSG for wp_nga is 4326, which is WGS 84. To compute the proximity distance matrix for clustering analysis, the coordinate reference system for attribute (wp_nga) and boundary (bdy_nga) data frames needs to transform into EPSG: 26391.

    3.6.1 Transform Attribute and Boundary

    st_set_crs( ) - sf - to update the coordinate reference system.

    wp_ngaTrans <- st_set_crs(wp_nga, 26391)
    Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
    that
    bdy_ngaTrans <- st_set_crs(bdy_nga, 26391)
    Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
    that

    3.6.2 Review CRS :: wp_ngaTrans

    Usage of the code chunk below :

    st_crs( ) - sf - to inspect the coordinate reference system.

    Show the code
    st_crs(wp_ngaTrans)
    Coordinate Reference System:
      User input: EPSG:26391 
      wkt:
    PROJCRS["Minna / Nigeria West Belt",
        BASEGEOGCRS["Minna",
            DATUM["Minna",
                ELLIPSOID["Clarke 1880 (RGS)",6378249.145,293.465,
                    LENGTHUNIT["metre",1]]],
            PRIMEM["Greenwich",0,
                ANGLEUNIT["degree",0.0174532925199433]],
            ID["EPSG",4263]],
        CONVERSION["Nigeria West Belt",
            METHOD["Transverse Mercator",
                ID["EPSG",9807]],
            PARAMETER["Latitude of natural origin",4,
                ANGLEUNIT["degree",0.0174532925199433],
                ID["EPSG",8801]],
            PARAMETER["Longitude of natural origin",4.5,
                ANGLEUNIT["degree",0.0174532925199433],
                ID["EPSG",8802]],
            PARAMETER["Scale factor at natural origin",0.99975,
                SCALEUNIT["unity",1],
                ID["EPSG",8805]],
            PARAMETER["False easting",230738.26,
                LENGTHUNIT["metre",1],
                ID["EPSG",8806]],
            PARAMETER["False northing",0,
                LENGTHUNIT["metre",1],
                ID["EPSG",8807]]],
        CS[Cartesian,2],
            AXIS["(E)",east,
                ORDER[1],
                LENGTHUNIT["metre",1]],
            AXIS["(N)",north,
                ORDER[2],
                LENGTHUNIT["metre",1]],
        USAGE[
            SCOPE["Engineering survey, topographic mapping."],
            AREA["Nigeria - onshore west of 6°30'E, onshore and offshore shelf."],
            BBOX[3.57,2.69,13.9,6.5]],
        ID["EPSG",26391]]

    3.6.3 Review CRS :: bdy_ngaTrans

    Show the code
    st_crs(bdy_ngaTrans)
    Coordinate Reference System:
      User input: EPSG:26391 
      wkt:
    PROJCRS["Minna / Nigeria West Belt",
        BASEGEOGCRS["Minna",
            DATUM["Minna",
                ELLIPSOID["Clarke 1880 (RGS)",6378249.145,293.465,
                    LENGTHUNIT["metre",1]]],
            PRIMEM["Greenwich",0,
                ANGLEUNIT["degree",0.0174532925199433]],
            ID["EPSG",4263]],
        CONVERSION["Nigeria West Belt",
            METHOD["Transverse Mercator",
                ID["EPSG",9807]],
            PARAMETER["Latitude of natural origin",4,
                ANGLEUNIT["degree",0.0174532925199433],
                ID["EPSG",8801]],
            PARAMETER["Longitude of natural origin",4.5,
                ANGLEUNIT["degree",0.0174532925199433],
                ID["EPSG",8802]],
            PARAMETER["Scale factor at natural origin",0.99975,
                SCALEUNIT["unity",1],
                ID["EPSG",8805]],
            PARAMETER["False easting",230738.26,
                LENGTHUNIT["metre",1],
                ID["EPSG",8806]],
            PARAMETER["False northing",0,
                LENGTHUNIT["metre",1],
                ID["EPSG",8807]]],
        CS[Cartesian,2],
            AXIS["(E)",east,
                ORDER[1],
                LENGTHUNIT["metre",1]],
            AXIS["(N)",north,
                ORDER[2],
                LENGTHUNIT["metre",1]],
        USAGE[
            SCOPE["Engineering survey, topographic mapping."],
            AREA["Nigeria - onshore west of 6°30'E, onshore and offshore shelf."],
            BBOX[3.57,2.69,13.9,6.5]],
        ID["EPSG",26391]]

    3.7 Exploratory Data Analysis (EDA)

    3.7.1 Geospatial EDA

    Reference of the code chunk below source from Lin S.Y., 20227.

  • 7 Lin S.Y.(2022). Regionalisation Using Water Point Availability in Nigeria. https://lins-92-isss624.netlify.app/take-home_ex02/take-home_ex02#exploratory-data-analysis

  • Show the code
    shapeName_na <- function(x){
      tm_shape(wp_ngaTrans) +
        tm_fill(col=x,
                style="pretty") +
        tm_borders(alpha=0.5) +
        tm_layout(legend.height = 0.2, 
                  legend.width = 0.2)
    }
    
    eda_wpNA <- map(names(wp_ngaTrans)
                    [c(3,5,7,13:15,21:24,28)], 
                    shapeName_na)
    tmap_arrange(eda_wpNA)

    Remarks :

    The LGA that does not have any water points as shown above to be removed from the data frame.

    3.7.1.1 filter state without water point

    Show the code
    wp_ngaTrim <- wp_ngaTrans %>%
      filter(if_all(
        starts_with("total_wp"),~. > 0))

    3.7.1.2 visualise distribution of functional and non-functional water points

    Show the code
    fun <- tm_shape (wp_ngaTrim) +
      tm_fill("pct_functional",
              style = "jenks",
              n=6,
              title = "Functional (%)") +
      tm_layout(main.title = "Distribution of Functional Water Point by LGA",
                main.title.position = "center",
                main.title.size = 0.7,
                main.title.fontface = "bold",
                legend.height = 0.45, 
                legend.width = 0.35,
                frame = TRUE) +
      tm_borders(alpha = 0.5)
    
    nfun <- tm_shape (wp_ngaTrim) +
      tm_fill("pct_nonFunctional",
              style = "jenks",
              n=6,
              title = "Non-Functional (%)") +
      tm_layout(main.title = "Distribution of Non Functional Water Point by LGA",
                main.title.position = "center",
                main.title.size = 0.7,
                main.title.fontface = "bold",
                legend.height = 0.45, 
                legend.width = 0.35,
                frame = TRUE) +
      tm_borders(alpha = 0.5)
    
    tmap_arrange (fun, nfun, ncol = 2, asp = 1)

    3.7.1.3 save and read RDS file

    Show the code
    write_rds(wp_ngaTrim,"data/geodata/wp_ngaTrim.rds")
    wp_ngaTrim <- read_rds("data/geodata/wp_ngaTrim.rds")

    3.7.2 Visualise Distribution of Cluster Variable

    Create a data frame with variables for clustering analysis before visualise the distribution.

    3.7.2.1 create cluster variable data frame

    Show the code
    cluster_vars <- wp_ngaTrim %>%
      st_set_geometry(NULL) %>%
      select("shapeName",
             "total_wp",
             "pct_functional", 
             "pct_nonFunctional",
             "pct_handPump",
             "pct_mechPump",
             "pct_tapStand",
             "pct_uc300",
             "pct_uc1000",
             "pct_ucN1000",
             "pct_uc250",
             "pct_urban0")

    3.7.2.2 replace row ID with “shapeName”

    Show the code
    row.names(cluster_vars) <- cluster_vars$shapeName
    cluster_vars <- cluster_vars %>%
      select(-shapeName)

    3.7.2.3 examine distribution of Cluster Variables

    Usage of the code chunk below :

    st_join( ) - sf - to join sf-class objects based on geometry, namely, wp_sf and bdy_nga.

    Reference of the code chunk below source from Lin S.Y., 20228.

  • 8 Lin S.Y.(2022). Regionalisation Using Water Point Availability in Nigeria. https://lins-92-isss624.netlify.app/take-home_ex02/take-home_ex02#exploratory-data-analysis

  • Show the code
    plot_num(cluster_vars)

    3.7.3 Identify Outliers

    Boxplot shows the minimum, maximum, median, first quartile, third quartile and outliers, if any.

    Show the code
    ggarrange(
      (ggplot(data=cluster_vars, aes(x=`pct_functional`)) + 
         geom_boxplot(color="black", fill="#19ff3fFF")), 
      ((ggplot(data=cluster_vars, aes(x=`pct_nonFunctional`)) +
          geom_boxplot(color="black", fill="#ff1919FF"))),
      ((ggplot(data = cluster_vars, aes(x = `pct_handPump`)) +
          geom_boxplot(color="black", fill="#FFA319FF"))),
      ((ggplot(data = cluster_vars, aes(x = `pct_mechPump`)) +
          geom_boxplot(color="black", fill="#ff8419FF"))),
      ((ggplot(data = cluster_vars, aes(x = `pct_tapStand`)) +
          geom_boxplot(color="black", fill="#ff5619FF"))),
      ((ggplot(data = cluster_vars, aes(x = `pct_urban0`)) +
          geom_boxplot(color="black", fill="#19beffFF"))),
      ((ggplot(data = cluster_vars, aes(x = `pct_uc1000`)) +
          geom_boxplot(color="black", fill="#C16622FF"))),
      ((ggplot(data = cluster_vars, aes(x = `pct_ucN1000`)) +
          geom_boxplot(color="black", fill="#543005FF"))),
       ncol = 2,
      nrow = 4)

    Remarks :

    4. CORRELATION ANALYSIS

    4.1 Visualise Correlation Matrix

    Usage of the code chunk below :

    corrplot.mixed( ) - corrplot - to use mixed methods to visualise a correlation matrix.

    This plot allows to identify the pattern and the relationship in the matrix.

    Show the code
    corrplot.mixed((cor(cluster_vars)),
                   upper = "number",
                   lower = "ellipse",
                   tl.col = "black",
                   diag = "l",
                   tl.pos = "lt")

    Remarks :

    Following are the pairs with strong correlation :

    correlation coefficients variable_1 variable_2
    1.00 pct_uc1000 pct_ucN1000
    1.00 pct_mechPump pct_uc1000
    -1.00 pct_mechPump pct_ucN1000
    -1.00 pct_uc1000 pct_ucN1000
    0.99 pct_tapStand pct_uc250
    0.99 pct_uc300 pct_ucN1000
    -0.99 pct_mechPump pct_uc300
    -0.99 pct_uc300 pct_uc1000

    Since only “usage_capacity” and “water_tech_category” are showing multicollinearity, hence the study will be split into 2 models with water point status by -

    1) “water_tech_category”

    2) “usage_capacity”

    4.2 Refine Model :: cluster_varsTech

    4.2.1 Remove “usage_capacity”
    Show the code
    cluster_varsTech <- cluster_vars %>%
      select(-pct_ucN1000, -pct_uc1000, -pct_uc300, -pct_uc250)
    Show the code
    corrplot.mixed((cor(cluster_varsTech)),
                   upper = "number",
                   lower = "ellipse",
                   tl.col = "black",
                   diag = "l",
                   tl.pos = "lt")

    Remarks :

    “pct_handPump” and “pct_mechPump” are negatively correlated at -0.82.

    Since hand pump is the instructed main water point technology in the scope of work, “pct_mechPump” will be removed from the model.

    Show the code
    cluster_varsTech1 <- cluster_varsTech %>%
      select(-pct_mechPump)

    4.2.2 Build Regression Model

    Show the code
    model_test <- lm(total_wp ~ pct_functional + 
             pct_nonFunctional +
             pct_handPump +
             pct_tapStand +
             pct_urban0,
             data = cluster_varsTech1)
    
    summary(model_test)
    
    Call:
    lm(formula = total_wp ~ pct_functional + pct_nonFunctional + 
        pct_handPump + pct_tapStand + pct_urban0, data = cluster_varsTech1)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -215.31  -54.75  -13.98   32.67  704.69 
    
    Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
    (Intercept)        69.4043    16.7459   4.145 3.79e-05 ***
    pct_functional     -0.2828     0.2070  -1.366 0.172198    
    pct_nonFunctional  -0.7306     0.2165  -3.374 0.000779 ***
    pct_handPump        1.3936     0.1284  10.852  < 2e-16 ***
    pct_tapStand        0.2251     1.1444   0.197 0.844110    
    pct_urban0          0.3684     0.1194   3.084 0.002115 ** 
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 97.34 on 755 degrees of freedom
    Multiple R-squared:  0.2087,    Adjusted R-squared:  0.2035 
    F-statistic: 39.83 on 5 and 755 DF,  p-value: < 2.2e-16

    4.2.3 Detect Multicollinearity

    Show the code
    car::vif(model_test)
       pct_functional pct_nonFunctional      pct_handPump      pct_tapStand 
             1.899980          1.607224          1.381794          1.018837 
           pct_urban0 
             1.133960 

    Remarks :

    Alert

    When encountering the following error :

    ” Error in vif.default(model_test) : there are aliased coefficients in the mode “

    It means two or more predictor variables in the model are perfectly correlated.

    Note

    Variables with VIF threshold value between 5 to 10 may need to be cautious but VIF greater than 10 can be problematic to the model performance due to serious collinearity problem.9

  • 9 Chouiry G. (2022). What is an Acceptable Value for VIF? (With References). https://quantifyinghealth.com/vif-threshold/

  • “A VIF value of 10 means that the tolerance of the relevant independent variable is 0.10 and that 90% (𝐑𝟐) of the variable is explained by other variables. This is undesirable, as it indicates that the relevant independent variable is unnecessarily included in the model.”10

  • 10 ResearchGate. Multicollinearity issues: is a value less than 10 acceptable for VIF? https://www.researchgate.net/post/Multicollinearity_issues_is_a_value_less_than_10_acceptable_for_VIF

  • When a study involve a large sample size, the VIF threshold value can be set up to 10.

    Based on the VIF report, there is no multicollinearity in the model.

    5. DATA STANDARDISATION

    5.1 Tidy Data Frame

    Show the code
    cluster_varsTech2 <- cluster_varsTech1 %>%
      select(-total_wp)

    5.2 Standardise Data

    As shown in 3.7.3.3, not all variables are not distributed normally. Hence, standardisation will be required before the clustering analysis.

    5.2.1 Standardise :: Min-Max Method
    Show the code
    wp_stdMM <- normalize(cluster_varsTech2)
    describe(wp_stdMM)
    wp_stdMM 
    
     5  Variables      761  Observations
    --------------------------------------------------------------------------------
    pct_functional 
           n  missing distinct     Info     Mean      Gmd      .05      .10 
         761        0      619        1    0.507   0.2684   0.1591   0.2170 
         .25      .50      .75      .90      .95 
      0.3333   0.4792   0.6749   0.8611   0.9169 
    
    lowest : 0.00000000 0.01724138 0.03030303 0.05555556 0.06666667
    highest: 0.98701299 0.99029126 0.99099099 0.99593496 1.00000000
    --------------------------------------------------------------------------------
    pct_nonFunctional 
           n  missing distinct     Info     Mean      Gmd      .05      .10 
         761        0      618        1   0.3654   0.2354  0.03012  0.08333 
         .25      .50      .75      .90      .95 
     0.22111  0.35593  0.50820  0.64444  0.72727 
    
    lowest : 0.000000000 0.004065041 0.009009009 0.009708738 0.012987013
    highest: 0.852941176 0.857142857 0.864864865 0.878787879 1.000000000
    --------------------------------------------------------------------------------
    pct_handPump 
           n  missing distinct     Info     Mean      Gmd      .05      .10 
         761        0      619        1   0.4956   0.3721  0.00000  0.03125 
         .25      .50      .75      .90      .95 
     0.18605  0.52555  0.78571  0.92593  0.96000 
    
    lowest : 0.000000000 0.009009009 0.009523810 0.012987013 0.013333333
    highest: 0.989898990 0.991011236 0.994555354 0.994604317 1.000000000
    --------------------------------------------------------------------------------
    pct_tapStand 
           n  missing distinct     Info     Mean      Gmd      .05      .10 
         761        0       57    0.212  0.01792  0.03465  0.00000  0.00000 
         .25      .50      .75      .90      .95 
     0.00000  0.00000  0.00000  0.00000  0.06681 
    
    lowest : 0.00000000 0.01121771 0.01427230 0.01468599 0.01757225
    highest: 0.71373913 0.81066667 0.95000000 0.96849558 1.00000000
    --------------------------------------------------------------------------------
    pct_urban0 
           n  missing distinct     Info     Mean      Gmd      .05      .10 
         761        0      447    0.975   0.7395   0.3267   0.0000   0.1818 
         .25      .50      .75      .90      .95 
      0.5922   0.8717   1.0000   1.0000   1.0000 
    
    lowest : 0.000000000 0.008298755 0.009345794 0.010309278 0.013888889
    highest: 0.992957746 0.993730408 0.993975904 0.996138996 1.000000000
    --------------------------------------------------------------------------------

    5.2.2 Standardise :: Z-score Method

    Show the code
    wp_stdZ <- scale(cluster_varsTech2)
    describe(wp_stdZ)
    wp_stdZ 
    
     5  Variables      761  Observations
    --------------------------------------------------------------------------------
    pct_functional 
            n   missing  distinct      Info      Mean       Gmd       .05       .10 
          761         0       619         1 4.094e-17     1.142   -1.4794   -1.2332 
          .25       .50       .75       .90       .95 
      -0.7384   -0.1182    0.7143    1.5062    1.7435 
    
    lowest : -2.155978 -2.082654 -2.027105 -1.919710 -1.872456
    highest:  2.041621  2.055563  2.058539  2.079565  2.096853
    --------------------------------------------------------------------------------
    pct_nonFunctional 
             n    missing   distinct       Info       Mean        Gmd        .05 
           761          0        618          1 -1.201e-18      1.139   -1.62178 
           .10        .25        .50        .75        .90        .95 
      -1.36438   -0.69793   -0.04573    0.69082    1.34989    1.75056 
    
    lowest : -1.767485 -1.747821 -1.723906 -1.720521 -1.704663
    highest:  2.358458  2.378783  2.416136  2.483486  3.069827
    --------------------------------------------------------------------------------
    pct_handPump 
            n   missing  distinct      Info      Mean       Gmd       .05       .10 
          761         0       619         1 3.424e-18     1.151   -1.5333   -1.4366 
          .25       .50       .75       .90       .95 
      -0.9577    0.0927    0.8977    1.3315    1.4369 
    
    lowest : -1.533321 -1.505447 -1.503854 -1.493139 -1.492068
    highest:  1.529392  1.532834  1.543799  1.543951  1.560645
    --------------------------------------------------------------------------------
    pct_tapStand 
             n    missing   distinct       Info       Mean        Gmd        .05 
           761          0         57      0.212 -1.375e-17      0.366    -0.1892 
           .10        .25        .50        .75        .90        .95 
       -0.1892    -0.1892    -0.1892    -0.1892    -0.1892     0.5165 
    
    lowest : -0.189230605 -0.070747233 -0.038484155 -0.034114693 -0.003629485
    highest:  7.349402711  8.373167729  9.844829943 10.040183334 10.372938392
    --------------------------------------------------------------------------------
    pct_urban0 
            n   missing  distinct      Info      Mean       Gmd       .05       .10 
          761         0       447     0.975 6.197e-17     1.038   -2.3488   -1.7713 
          .25       .50       .75       .90       .95 
      -0.4677    0.4200    0.8274    0.8274    0.8274 
    
    lowest : -2.3488119 -2.3224529 -2.3191273 -2.3160670 -2.3046972
    highest:  0.8050784  0.8075325  0.8083123  0.8151828  0.8274464
    --------------------------------------------------------------------------------

    Remarks :

    Comparing the reports above, the Min-Max method is the only method that can standardise the value to between 0 and 1.

    5.2.3 Compare Distribution For Standardisation Method

    Visualise to determine which standardisation method provide the better output.

    Show the code
    ggarrange(
      (ggplot(data = cluster_varsTech2, aes(x = `pct_functional`)) +
        geom_density(color = "black", fill = "#19ff3fFF") + 
        ggtitle("Before Standardisation")),
      (ggplot(data = (as.data.frame(wp_stdMM)), aes(x = `pct_functional`)) +
          geom_density(color = "black", fill = "#19ff3fFF") +
          ggtitle("Min-Max Stdsn.")),
      (ggplot(data = (as.data.frame(wp_stdZ)), aes(x = `pct_functional`)) +
         geom_density(color = "black", fill="#19ff3fFF") +
         ggtitle("Z-score Stdsn.")),
       ncol = 3)

    Show the code
    ggarrange(
      (ggplot(data = cluster_varsTech2, aes(x = `pct_nonFunctional`)) +
        geom_density(color = "black", fill = "#ff1919FF") + 
        ggtitle("Before Standardisation")),
      (ggplot(data = (as.data.frame(wp_stdMM)), aes(x = `pct_nonFunctional`)) +
          geom_density(color = "black", fill = "#ff1919FF") +
          ggtitle("Min-Max Stdsn.")),
      (ggplot(data = (as.data.frame(wp_stdZ)), aes(x = `pct_nonFunctional`)) +
         geom_density(color = "black", fill="#ff1919FF") +
         ggtitle("Z-score Stdsn.")),
       ncol = 3)

    Show the code
    ggarrange(
      (ggplot(data = cluster_varsTech2, aes(x = `pct_handPump`)) +
        geom_density(color = "black", fill = "#FFA319FF") + 
        ggtitle("Before Standardisation")),
      (ggplot(data = (as.data.frame(wp_stdMM)), aes(x = `pct_handPump`)) +
          geom_density(color = "black", fill = "#FFA319FF") +
          ggtitle("Min-Max Stdsn.")),
      (ggplot(data = (as.data.frame(wp_stdZ)), aes(x = `pct_handPump`)) +
         geom_density(color = "black", fill="#FFA319FF") +
         ggtitle("Z-score Stdsn.")),
       ncol = 3)

    Show the code
    ggarrange(
      (ggplot(data = cluster_varsTech2, aes(x = `pct_tapStand`)) +
        geom_density(color = "black", fill = "#ff5619FF") + 
        ggtitle("Before Standardisation")),
      (ggplot(data = (as.data.frame(wp_stdMM)), aes(x = `pct_tapStand`)) +
          geom_density(color = "black", fill = "#ff5619FF") +
          ggtitle("Min-Max Stdsn.")),
      (ggplot(data = (as.data.frame(wp_stdZ)), aes(x = `pct_tapStand`)) +
         geom_density(color = "black", fill="#ff5619FF") +
         ggtitle("Z-score Stdsn.")),
       ncol = 3)

    Show the code
    ggarrange(
      (ggplot(data = cluster_varsTech2, aes(x = `pct_urban0`)) +
        geom_density(color = "black", fill = "#19beffFF") + 
        ggtitle("Before Standardisation")),
      (ggplot(data = (as.data.frame(wp_stdMM)), aes(x = `pct_urban0`)) +
          geom_density(color = "black", fill = "#19beffFF") +
          ggtitle("Min-Max Stdsn.")),
      (ggplot(data = (as.data.frame(wp_stdZ)), aes(x = `pct_urban0`)) +
         geom_density(color = "black", fill="#19beffFF") +
         ggtitle("Z-score Stdsn.")),
       ncol = 3)

    6. CLUSTERING ANALYSIS

    6.1 Hierarchical Clustering

    There are four (4) main steps :

    • compute proximity matrix and determine clustering algorithm.
    • compute Hierarchical Clustering.
    • identify optimal number of cluster and merge similar clusters.
    • update the distance matrix.

    6.1.1 Compute Proximity Matrix and Clustering Algorithm

    First determine the clustering algorithm before compute Hierarchical Clustering analysis.

    6.1.1.1 determine Hierarchical Clustering algorithm

    Usage of the code chunk below :

    agnes( ) - cluster - to get agglomerative coefficient of 4 clustering structure, namely “average”, “single”, “complete” and “ward”.

    Show the code
    m <- c("average","single","complete","ward")
    
    names(m) <- c("average","single","complete","ward")
    
    ac <- function(x){agnes(wp_stdMM, method = x)$ac}
    
    map_dbl(m, ac)
      average    single  complete      ward 
    0.9287791 0.8321098 0.9529544 0.9918906 

    Remarks :

    • Value 1 indicates the strongest clustering structure.

    • Ward’s method provides the strongest clustering structure. Therefore, Ward’s method is to be used in subsequent analysis.

    6.1.1.2 compute proximity matrix

    Usage of the code chunk below :

    dist( ) - stats - to compute the proximity distance matrix. Among euclidean, maximum, manhattan, canberra, binary and minkowski, euclidean is used to compute proxmat_euc.

    Show the code
    prox_mat_euc <- dist(wp_stdMM, 
                         method = 'euclidean')

    6.1.2 Compute Hierarchical Clustering

    Usage of the code chunk below :

    hclust( ) - stats - to compute cluster with agglomeration method.

    ggdendrogram( ) - ggdendro - to plot dendrogram with tools available in ggplot2.

    Show the code
    hclust_ward <- hclust(prox_mat_euc, 
                          method = 'ward.D')
    ggdendrogram(hclust_ward,
                 rotate = TRUE,
                 cex = 0.1,
                 theme_dendro = FALSE)

    Remarks :

    This is in spite of adjusting the cex= parameter that scales the resolution of the dendogram to 10%. 11

  • 11 Chua Y.T. (2022). Take-home Exercise 2 - Regionalisation of Multivariate Water Point Attributes with Non-spatially Constrained and Spatially Constrained Clustering Methods. https://isss624-amelia.netlify.app/exercises/take-home_ex2/take-home_ex2#computing-hierarchical-clustering

  • 6.1.3 Identify Number of Optimal Cluster

    To determine the optimal clusters to retain, following commons methods are tested :

    • Gap statistic

    • Elbow

    • Average Silhouette

    6.1.3.1 compute Gap Statistic method

    Usage of the code chunk below :

    clusGap( ) - cluster - to compute the gap statistic.

    Show the code
    set.seed(12345)
    
    gap_stat <- clusGap(wp_stdMM, 
                        FUN = hcut, 
                        nstart = 25, 
                        K.max = 7, 
                        B = 50)
    
    print(gap_stat, method = "firstmax")
    Clustering Gap statistic ["clusGap"] from call:
    clusGap(x = wp_stdMM, FUNcluster = hcut, K.max = 7, B = 50, nstart = 25)
    B=50 simulated reference sets, k = 1..7; spaceH0="scaledPCA"
     --> Number of clusters (method 'firstmax'): 5
             logW   E.logW       gap      SE.sim
    [1,] 4.923949 5.477941 0.5539917 0.007618469
    [2,] 4.731938 5.364838 0.6329008 0.012434270
    [3,] 4.617999 5.286602 0.6686036 0.010528780
    [4,] 4.497712 5.221510 0.7237980 0.011597296
    [5,] 4.419187 5.168516 0.7493295 0.011180606
    [6,] 4.379879 5.123591 0.7437121 0.010223152
    [7,] 4.336848 5.085791 0.7489429 0.010297232

    Remarks :

    The number of clusters recommended by “firstmax” approach of Gap Statistic method is 5.

    6.1.3.2 visualise gap_stat

    Usage of the code chunk below :

    fviz_nbclust( ) - factoextra - to compute and visualise the Optimal Number of clusters.

    Show the code
    set.seed(12345)
    
    fviz_nbclust(wp_stdMM,
                 FUNcluster = hcut,
                 nstart = 25,  
                 method = "gap_stat", 
                 nboot = 50,
                 linecolor = "white")+
      theme_dark() +
      labs(subtitle = "Gap statistic method")

    6.1.3.3 compute and visualise Total Within Sum of Squares (Elbow) method

    Show the code
    fviz_nbclust(wp_stdMM, 
                 kmeans, 
                 method = "wss",
                 linecolor = "white")+
      theme_dark() +
      labs(subtitle = "Elbow method")

    6.1.3.4 compute and visualise Silhouette method

    Show the code
    fviz_nbclust(wp_stdMM, 
                 kmeans, 
                 method = "silhouette",
                 linecolor = "white") +
      theme_dark() +
      labs(subtitle = "Silhouette method")

    Remarks :

    Method Gap stat Elbow Silhouette
    Optimal Value for K 5 3 3

    Given the Elbow method, Silhouette method and Gap Statistic method, the 5-cluster by Silhouette method will be used for the rest of the study.

    6.1.4 Merge Similar Clusters

    Usage of the code chunk below :

    rect.hclust( ) - stats - to draw the dendrogram with a border around the selected clusters.

    Show the code
    plot(hclust_ward, cex = 0.5)
    rect.hclust(hclust_ward, 
                k = 5, 
                border = 2:5)

    6.1.5 Visually-Driven Hierarchical Clustering Analysis

    The data is loaded into a data frame, but it has to be a data matrix to plot the heatmap. Hence, the data frame will need to first transform into a matrix.

    6.1.5.1 transform data frame into matrix

    Usage of the code chunk below :

    data.matrix( ) - base - to transform cluster_varsTrim data frame into a data matrix, and named it as nga_clustMat.

    Show the code
    wp_stdMM_mat <- data.matrix(wp_stdMM)

    6.1.5.2 plot interactive cluster heatmap

    Usage of the code chunk below :

    heatmaply( ) - heatmaply - to build an interactive cluster heatmap.

    Show the code
    heatmaply(normalize(wp_stdMM_mat),
              Colv = NA,
              dist_method = "euclidean",
              hclust_method = "ward.D",
              seriate = "OLO",
              colors = Blues,
              k_row = 5,
              margins = c(NA,200,60,NA),
              fontsize_row = 1,
              fontsize_col = 5,
              main="Geographic Segmentation of Nigeria by Water Points",
              xlab = "Water Points",
              ylab = "Nigeria LGA"
              )

    Remarks :

    Cluster 4 and 5 have higher percentage of functional water points.

    Cluster 1, 2 and 3 have higher percentage of non-functional water points.

    Cluster 1, 2, 4 and 5 have higher percentage of hand pump deployed for the water points.

    Cluster 2, 3 and 5 have higher percentage of rural communities.

    6.1.5.3 create 5-cluster model

    Usage of the code chunk below :

    cutree( ) - base - to derive a 5-cluster model, and named the output as groups.

    Show the code
    groups <- as.factor(cutree(hclust_ward, k = 5))

    6.1.5.4 append groups to wp_ngaTrim

    Show the code
    nga_clust.sf <- cbind(wp_ngaTrim, as.matrix(groups)) %>%
      rename(`cluster`=`as.matrix.groups.`)

    6.1.5.5 plot choropleth map :: nga_clust.sf

    Show the code
    clusGeo.map <- tm_shape(nga_clust.sf) +
      tm_fill(col = "cluster",
              title = "Cluster") +
      tm_borders(alpha = 0.3) +
      tm_layout(main.title = "Non-Spatial with \n Hierarchical Clustering",
                main.title.size = 1.2,
                main.title.position = "center",
                legend.height = 0.4,
                legend.width = 0.2, 
                legend.title.size = 2,
                legend.text.size = 2,
                frame = TRUE,
                asp = 0)
    
    clusGeo.map

    Remarks :

    The choropleth map above shows the fragmented clusters by the used of non-spatial clustering algorithm (hierarchical cluster analysis method).

    6.2 Spatially Constrained Clustering :: SKATER Approach

    SKATER function only support sp objects in SpatialPolygonDataFrame. Hence, the wp_ngaTrans has to first transform into SpatialPolygonDataFrame before proceed further.

    6.2.1 Convert SF to SP Data Frame

    Usage of the code chunk below :

    as_Spatial( ) - sf - to convert wp_ngaTrans into nga_sp in a SP data frame.

    Show the code
    nga.sp <- as_Spatial(wp_ngaTrim)

    6.2.2 Compute Neighbour List

    First compute the neighbour list before plot it.

    6.2.2.1 compute neighbour list from polygon list

    Usage of the code chunk below :

    poly2nb( ) - spdep - to compute the neighbours list from polygon list.

    Show the code
    nga.nb <- poly2nb(nga.sp, queen = TRUE)
    
    summary(nga.nb)
    Neighbour list object:
    Number of regions: 761 
    Number of nonzero links: 4348 
    Percentage nonzero weights: 0.750793 
    Average number of links: 5.713535 
    Link number distribution:
    
      1   2   3   4   5   6   7   8   9  10  11  12  14 
      4  16  57 122 177 137 121  71  39  11   4   1   1 
    4 least connected regions:
    136 497 513 547 with 1 link
    1 most connected region:
    496 with 14 links

    Remarks :

    There is no LGA without link.

    6.2.2.2 plot Neighbour List by Centroid Node

    Usage of the code chunk below : plot the boundary first before the neighbour list object to avoid any region from being clipped away.

    Show the code
    plot(nga.sp, 
         border = grey(.5))
    
    plot(nga.nb, 
         coordinates(nga.sp), 
         col = "blue", 
         add = TRUE)

    6.2.3 Compute Minimum Spanning Tree (MST)

    To find a minimum path connecting all nodes in a graph, a minimum spanning tree with minimum weight than all other spanning trees to be used in subsequent steps.

    6.2.3.1 calculate edge costs

    Usage of the code chunk below :

    nbcosts( ) - spdep - to compute the cost of each edge which is the distance between nodes.

    Show the code
    edge_cost <- nbcosts(nga.nb, wp_stdMM)

    6.2.3.2 specify spatial weight

    nb2listw( ) - spdep - to specify edge_cost as the spatial weights. Set the “style” to “B” to ensure the cost values are not row-standardised.

    Show the code
    nga.w <- nb2listw(nga.nb,
                      edge_cost,
                      style = "B")
    summary(nga.w)
    Characteristics of weights list object:
    Neighbour list object:
    Number of regions: 761 
    Number of nonzero links: 4348 
    Percentage nonzero weights: 0.750793 
    Average number of links: 5.713535 
    Link number distribution:
    
      1   2   3   4   5   6   7   8   9  10  11  12  14 
      4  16  57 122 177 137 121  71  39  11   4   1   1 
    4 least connected regions:
    136 497 513 547 with 1 link
    1 most connected region:
    496 with 14 links
    
    Weights style: B 
    Weights constants summary:
        n     nn       S0       S1       S2
    B 761 579121 1902.565 2276.999 23657.73

    6.2.3.3 compute MST

    Usage of the code chunk below :

    nbcosts( ) - spdep - to compute the minimum spanning tree.

    Show the code
    nga_minSpanT <- mstree(nga.w)

    6.2.3.4 review class and dimension of the computed MST

    Show the code
    class(nga_minSpanT)
    [1] "mst"    "matrix"
    Show the code
    dim(nga_minSpanT)
    [1] 760   3
    Show the code
    head(nga_minSpanT)
         [,1] [,2]       [,3]
    [1,]  109  517 0.11534401
    [2,]  517  685 0.12322370
    [3,]  685  678 0.10131893
    [4,]  685   46 0.13142212
    [5,]  517  229 0.15873997
    [6,]  229  260 0.08560336

    6.2.3.5 plot MST Neighbour List

    Show the code
    plot(nga.sp, border = gray(.5))
    
    plot.mst(nga_minSpanT,
             coordinates(nga.sp), 
             col = "blue", 
             cex.lab = 0.7, 
             cex.circles = 0.005, 
             add = TRUE)

    6.2.4 Compute Spatially Constrained Cluster

    Usage of the code chunk below :

    skater( ) - spdep - to compute the spatially constrained cluster.

    note: ncuts = number of clusters - 1

    Show the code
    clust5 <- spdep::skater(edges = nga_minSpanT[,1:2],
                            data = cluster_varsTech2,
                            method = "euclidean",
                            ncuts = 4)
    str(clust5)
    List of 8
     $ groups      : num [1:761] 5 5 2 5 2 4 4 2 3 3 ...
     $ edges.groups:List of 5
      ..$ :List of 3
      .. ..$ node: num [1:256] 87 668 49 683 695 229 517 109 164 32 ...
      .. ..$ edge: num [1:255, 1:3] 651 639 761 735 660 429 482 679 146 112 ...
      .. ..$ ssw : num 10261
      ..$ :List of 3
      .. ..$ node: num [1:288] 324 209 208 556 197 408 458 62 88 752 ...
      .. ..$ edge: num [1:287, 1:3] 209 208 197 430 92 30 428 3 508 729 ...
      .. ..$ ssw : num 12090
      ..$ :List of 3
      .. ..$ node: num [1:80] 332 362 715 101 605 566 22 524 187 288 ...
      .. ..$ edge: num [1:79, 1:3] 551 25 589 590 549 10 42 287 155 187 ...
      .. ..$ ssw : num 3447
      ..$ :List of 3
      .. ..$ node: num [1:77] 363 81 14 567 27 171 346 28 174 212 ...
      .. ..$ edge: num [1:76, 1:3] 567 14 81 346 50 174 48 212 531 363 ...
      .. ..$ ssw : num 3356
      ..$ :List of 3
      .. ..$ node: num [1:60] 102 546 33 282 325 369 538 201 720 525 ...
      .. ..$ edge: num [1:59, 1:3] 611 574 358 600 620 358 359 2 135 538 ...
      .. ..$ ssw : num 1933
     $ not.prune   : NULL
     $ candidates  : int [1:5] 1 2 3 4 5
     $ ssto        : num 39075
     $ ssw         : num [1:5] 39075 34440 33199 31933 31088
     $ crit        : num [1:2] 1 Inf
     $ vec.crit    : num [1:761] 1 1 1 1 1 1 1 1 1 1 ...
     - attr(*, "class")= chr "skater"

    6.2.4.1 tabulate cluster assignment

    Show the code
    ccs5 <- clust5$groups
    table(ccs5)
    ccs5
      1   2   3   4   5 
    256 288  80  77  60 

    6.2.4.2 plot the pruned tree

    Show the code
    plot(nga.sp, border = gray(.5))
    plot(clust5, 
         coordinates(nga.sp), 
         cex.lab = .7,
         groups.colors = c("red","green","blue", "brown"),
         cex.circles = 0.005, 
         add = TRUE)
    Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
    coords[id2, : "add" is not a graphical parameter
    
    Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
    coords[id2, : "add" is not a graphical parameter
    
    Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
    coords[id2, : "add" is not a graphical parameter
    
    Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
    coords[id2, : "add" is not a graphical parameter
    
    Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
    coords[id2, : "add" is not a graphical parameter

    6.2.5 Visualise SKATER Clusters in Choropleth Map

    Show the code
    groups_mat <- as.matrix(clust5$groups)
    
    nga_spClust.sf <- cbind(nga_clust.sf, 
                            as.factor(groups_mat)) %>%
      rename(`sp_cluster`=`as.factor.groups_mat.`)

    To compare the output of hierarchical clustering and spatially constrained hierarchical clustering :

    Show the code
    clusGeo_SKAT.map <- tm_shape(nga_spClust.sf) +
      tm_fill(col = "sp_cluster",
              title = "Cluster") +
      tm_borders(alpha = 0.3) +
      tm_layout(main.title = "Spatially Constrained \n with SKATER Method",
                main.title.size = 1.2,
                main.title.position = "center",
                legend.height = 0.4,
                legend.width = 0.2, 
                legend.title.size = 2,
                legend.text.size = 2,
                frame = TRUE,
                asp = 0)
    
    clusGeo_SKAT.map

    6.3 ClustGeo Method

    This section consists of two (2) parts i.e. spatially and non-spatially Constrained Cluster analysis.

    6.3.1 Non-Spatially Constrained Hierarchical Cluster Analysis

    Dissimilarity matrix must be an object of class dist.

    6.3.1.1 create Class dist Object

    Show the code
    proxmat_ngc <- dist(wp_stdMM, method = 'euclidean')

    6.3.1.2 compute Non-Spatially Constrained Hierarchical Clustering

    Usage of the code chunk below :

    hclustgeo( ) - ClustGeo - to perform a typical Ward-like hierarchical clustering.

    Show the code
    nonGeo_clust <- hclustgeo(proxmat_ngc)
    plot(nonGeo_clust, 
         cex = 0.5)
    rect.hclust(nonGeo_clust, 
                k = 5, 
                border = 2:5)

    6.3.1.3 derive 5-cluster model

    Show the code
    groups_ngc <- as.factor(cutree(nonGeo_clust, 
                                   k = 5))

    6.3.1.4 combine groups_ngc with wp_ngaTrim

    Show the code
    nga_ngeo_clust.sf <- cbind(wp_ngaTrim, as.matrix(groups_ngc)) %>%
      rename(`cluster` = `as.matrix.groups_ngc.`)

    6.3.1.5 visualise Non-Spatially Constrained Hierarchical Cluster

    Show the code
    clusGeo_nSp.map <- tm_shape(nga_ngeo_clust.sf) +
      tm_fill(col = "cluster",
              title = "Cluster") +
      tm_borders(alpha = 0.3) +
      tm_layout(main.title = "Non-Spatially Constrained \n with ClustGeo Method",
                main.title.size = 1.2,
                main.title.position = "center",
                legend.height = 0.4,
                legend.width = 0.2, 
                legend.title.size = 2,
                legend.text.size = 2,
                frame = TRUE,
                asp = 0)
    
    clusGeo_nSp.map

    6.3.2 Spatially Constrained Hierarchical Cluster Analysis

    6.3.2.1 determine Spatial Distance Matrix

    Usage of the code chunk below :

    st_distance( ) - sf - to derive the spatial distance matrix before perform spatially constrained hierarchical clustering.

    as.dist( ) - stats - to convert the data frame into matrix.

    dist <- st_distance(wp_ngaTrim, wp_ngaTrim)
    dist_mat <- as.dist(dist)

    6.3.2.2 determine Alpha value

    choicealpha( ) - psych - to determine a suitable value for the mixing parameter alpha.

    cr <- choicealpha(
      proxmat_ngc, 
      dist_mat, 
      range.alpha = seq(0, 1, 0.1), 
      K = 5, 
      graph = TRUE)

    Remarks :

    With reference to the plot above, alpha = 0.5 to be used to perform spatially constrained hierarchical clustering.

    6.3.2.3 compute Spatially Constrained Hierarchical Clustering

    Show the code
    clustG <- hclustgeo(proxmat_ngc, 
                        dist_mat, 
                        alpha = 0.5)

    6.3.2.4 derive “cluster” Object

    Show the code
    groups_cg <- as.factor(cutree(clustG, k = 5))

    6.3.2.5 combine groups_cg with wp_ngaTrim

    Show the code
    wp_nga_clustG <- cbind(wp_ngaTrim, as.matrix(groups_cg)) %>%
      rename(`cluster` = `as.matrix.groups_cg.`)

    6.3.2.6 Visualise Spatially Constrained Hierarchical Clustering

    Show the code
    clusGeo_sp.map <- tm_shape(wp_nga_clustG) +
      tm_fill(col = "cluster",
              title = "Cluster") +
      tm_borders(alpha = 0.3) +
      tm_layout(main.title = "Spatially Constrained \n with ClustGeo Method",
                main.title.size = 1.2,
                main.title.position = "center",
                legend.height = 0.4,
                legend.width = 0.2, 
                legend.title.size = 2,
                legend.text.size = 2,
                frame = TRUE,
                asp = 0)
    
    clusGeo_sp.map

    6.4 Spatially Constrained Clustering :: RedCap Method

    6.4.1 Derive Queen Contiguity Spatial Weights

    Usage of the code chunk below :

    queen_weights( ) - rgeoda - to create a Queen contiguity weights.

    Show the code
    nga_queenW <- queen_weights(wp_ngaTrim)
    
    nga_queenW
    Reference class object of class "Weight"
    Field "gda_w":
    An object of class "p_GeoDaWeight"
    Slot "pointer":
    <pointer: 0x000001b7c5f02d20>
    
    Field "is_symmetric":
    [1] TRUE
    Field "sparsity":
    [1] 0.00750793
    Field "min_neighbors":
    [1] 1
    Field "max_neighbors":
    [1] 14
    Field "num_obs":
    [1] 761
    Field "mean_neighbors":
    [1] 5.713535
    Field "median_neighbors":
    [1] 6
    Field "has_isolates":
    [1] FALSE

    6.4.2 Compute Spatially Constrained Cluster

    Usage of the code chunk below :

    redcap( ) - rgeoda - to compute spatially constrained cluster based on three (3) mandatory arguments:-

    • k, the number of clusters to form

    • w, an instance of Weight class

    • df, data frame with cluster variables

    Show the code
    set.seed(12345)
    
    clust_rCap <- redcap(5, 
                         nga_queenW, 
                         wp_stdMM, 
           method = "fullorder-singlelinkage",
           scale_method = 'raw',
           distance_method = "euclidean",
           random_seed = 12345)
    
    str(clust_rCap)
    List of 5
     $ Clusters                                    : int [1:761] 2 2 2 2 2 1 1 2 3 2 ...
     $ Total sum of squares                        : num 3800
     $ Within-cluster sum of squares               : num [1:5] 500 640 397 734 648
     $ Total within-cluster sum of squares         : num 880
     $ The ratio of between to total sum of squares: num 0.232

    6.4.3 Reveal Cluster Assignment

    Show the code
    redCap_cluster <- clust_rCap$Cluster
    
    table(redCap_cluster)
    redCap_cluster
      1   2   3   4   5 
    342 339  50  20  10 

    6.4.4 Visualise Spatially Constrained Cluster

    Show the code
    redCap_mat <- as.matrix(redCap_cluster)
    
    wp_nga_redCap <- cbind(wp_ngaTrim, as.factor(redCap_mat)) %>%
      rename(`cluster`=`as.factor.redCap_mat.`)
    Show the code
    redCap_sp.map <- tm_shape(wp_nga_redCap) +
      tm_fill(col = "cluster",
              title = "Cluster") +
      tm_borders(alpha = 0.3) +
      tm_layout(main.title = "Spatially Constrained \n with RedCap Method",
                main.title.size = 1.2,
                main.title.position = "center",
                legend.height = 0.4,
                legend.width = 0.2, 
                legend.title.size = 2,
                legend.text.size = 2,
                frame = TRUE,
                asp = 0)
    
    redCap_sp.map

    7. VISUAL INTERPRETATION

    7.1 Visualise Individual Clustering Variable

    7.1.1 Plot Boxplot

    Show the code
    ggplot(data = nga_ngeo_clust.sf,
           aes(x = cluster, y = pct_nonFunctional)) +
      geom_boxplot()

    Remarks :

    The boxplot reveals Cluster 3 displays the highest mean of non-functional water points. This is followed by Cluster 2.

    7.1.2 Visualise Multivariate

    Create a separate data frame to ensure key variables are included.

    7.1.2.1 prepare data frame

    nga_ngeo_clust.sf1 <- nga_ngeo_clust.sf %>%
      select("shapeName", 
             "pct_functional", 
             "pct_nonFunctional",
             "pct_handPump", 
             "pct_tapStand",
             "pct_urban0",
             "cluster")
             
    head(nga_ngeo_clust.sf1,3)
    Simple feature collection with 3 features and 7 fields
    Geometry type: MULTIPOLYGON
    Dimension:     XY
    Bounding box:  xmin: 6.778522 ymin: 5.052192 xmax: 7.402708 ymax: 9.232154
    Projected CRS: Minna / Nigeria West Belt
      shapeName pct_functional pct_nonFunctional pct_handPump pct_tapStand
    1 Aba North       41.17647          52.94118    11.764706            0
    2 Aba South       40.84507          49.29577     9.859155            0
    3     Abaji       40.35088          59.64912    40.350877            0
      pct_urban0 cluster                       geometry
    1   0.000000       1 MULTIPOLYGON (((7.401109 5....
    2   5.633803       1 MULTIPOLYGON (((7.334479 5....
    3  84.210526       2 MULTIPOLYGON (((7.045872 9....

    7.1.2.2 plot parallel coordinate plot

    Usage of the code chunk below :

    ggparcoord( ) - GGally - to plot static parallel coordinate plots to reveal distribution of variables by cluster.

    • scale is a character string with the following options -

      • std : univariately, subtract mean & divide by standard deviation.

      • robust : univariately, subtract median & divide by median absolute deviation.

      • uniminmax : univariately, scale the minimum to 0, the maximum to 1.

      • globalminmax : no scaling; the range of the graphs is defined by the global minimum and the global maximum.

      • center : use uniminmax to standardize vertical height, then center each variable at a value specified by the scaleSummary param.

      • centerObs : use uniminmax to standardize vertical height, then center each variable at the value of the observation specified by the centerObsID param

    ggparcoord(data = nga_ngeo_clust.sf1[
      nga_ngeo_clust.sf1$cluster %in% c(2:6),],
               columns = c(2:6),
               scale = "uniminmax",
               alphaLines = 0.2,
               boxplot = TRUE,
               groupColumn = 3) +
      labs(title = "Visual Clustering for ClustGeo Method Spatially Constrained Clustering",
           subtitle = "Multiple Parallel Coordinates Plot") +
      scale_color_viridis(discrete = FALSE) +
      facet_grid(~ cluster,) +
      theme(axis.text.x = element_text(angle = 90),
            text = element_text(size = 12))
    Warning: The following aesthetics were dropped during statistical transformation: colour
    ℹ This can happen when ggplot fails to infer the correct grouping structure in
      the data.
    ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
      variable into a factor?
    The following aesthetics were dropped during statistical transformation: colour
    ℹ This can happen when ggplot fails to infer the correct grouping structure in
      the data.
    ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
      variable into a factor?
    The following aesthetics were dropped during statistical transformation: colour
    ℹ This can happen when ggplot fails to infer the correct grouping structure in
      the data.
    ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
      variable into a factor?
    The following aesthetics were dropped during statistical transformation: colour
    ℹ This can happen when ggplot fails to infer the correct grouping structure in
      the data.
    ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
      variable into a factor?

    Remarks :

    Based on the parallel coordinate plot above, insights for the stakeholder or decision makers from :

    • the Federal Ministry of Agriculture & Rural Development (FMARD) -

      • Cluster 2 LGA has the highest percentage of non-functional water points, followed by Cluster 1 LGA.

      • Cluster 2 also has the highest percentage of rural communities.

    !!!!! 1 cluster is missing. Evaluation may be misled.

    7.2 Compare Clustering Method Visually

    Show the code
    tmap_arrange(clusGeo.map, 
                 clusGeo_SKAT.map, 
                 clusGeo_nSp.map, 
                 clusGeo_sp.map, 
                 redCap_sp.map,
                 ncol = 2)

    Remarks :

    The output by both Hierarchical Clustering and Hierarchical Clustering by ClustGeo method are relatively similar.

    However, for spatially constrained clustering, ClustGeo method produce a desirable output compare to SKATER and RedCap methods.

    8. CONCLUSION

    8.1 Conclude with Summary Statistics

    nga_ngeo_clust.sf1 %>% 
      st_set_geometry(NULL) %>%
      group_by(cluster) %>%
      summarise(mean_pct_functional = mean(pct_functional),
                mean_pct_nonFunctional = mean(pct_nonFunctional),
                mean_pct_handPump = mean(pct_handPump), 
                mean_pct_tapStand = mean(pct_tapStand), 
                mean_pct_urban0 = mean(pct_urban0))
    # A tibble: 5 × 6
      cluster mean_pct_functional mean_pct_nonFunctional mean_pct_…¹ mean_…² mean_…³
      <chr>                 <dbl>                  <dbl>       <dbl>   <dbl>   <dbl>
    1 1                      38.8                   30.1        7.50  0.274     57.3
    2 2                      43.6                   42.7       62.1   1.33      88.8
    3 3                      33.3                   62.3       22.2   0.0885    94.4
    4 4                      53.7                   29.6       45.2   0.582     16.4
    5 5                      79.9                   19.7       83.7   0.0567    88.9
    # … with abbreviated variable names ¹​mean_pct_handPump, ²​mean_pct_tapStand,
    #   ³​mean_pct_urban0

    Remarks :

    Non-Functional Water Point

    • More than 90% of the water points in Cluster 3 are within non-urban communities, and approximately 62% of these water points are non-functional.

    • Although Cluster 2 and Cluster 5 have similar percentage of approximately 88% water points falls within non-urban communities, the non-functional water points in Cluster 2 is more than twice the amount in Cluster 5 LGA.

      • Resources may required to attain more information regarding the stakeholder who task to maintain the water points that identified as non-functional but having status indicated otherwise.

      • !!!! Analysis need to include parameter such as “status” to support the statement above.

    • Cluster 1, however, may require more resources to update the water point data. Unlike Cluster 2, 3, and 5 where more than 85% of water points are updated with data, only two-third of Cluster 1’s water points status and various attributes are known.